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Abstract 

Handwritten  historical  document  collections  in  libraries 
and  other  areas  are  often  of  interest  to  researchers,  stu¬ 
dents  or  the  general  public.  Convenient  access  to  such  cor¬ 
pora  generally  requires  an  index,  which  allows  one  to  lo¬ 
cate  individual  text  units  (pages,  sentences,  lines)  that  are 
relevant  to  a  given  query  (usually  provided  as  text).  Several 
solutions  are  possible:  manual  annotation  (very  expensive), 
handwriting  recognition  (poor  results)  and  word  spotting  - 
an  image  matching  approach  (computationally  expensive). 

In  this  work,  we  present  a  novel  retrieval  approach  for 
historical  document  collections,  which  does  not  require 
recognition.  We  assume  that  word  images  can  be  described 
using  a  vocabulary  of  discretized  word  features.  From  a 
training  set  of  labeled  word  images,  we  extract  discrete  fea¬ 
ture  vectors,  and  estimate  the  joint  probability  distribution 
of  features  and  word  labels.  For  a  given  feature  vector  (i.e. 
a  word  image),  we  can  then  calculate  conditional  probabil¬ 
ities  for  all  labels  in  the  training  vocabulary.  Experiments 
show  that  this  relevance -based  language  model  works  very 
well  with  a  mean  average  precision  of  89%  for  4-word 
queries  on  a  subset  of  George  Washington's  manuscripts. 

1.  Introduction 

Libraries  are  in  the  transition  from  offering  strictly  paper- 
based  material  to  providing  electronic  versions  of  their  col¬ 
lections.  For  simple  access,  multimedia  information,  such 
as  audio,  video  or  images,  requires  an  index  that  allows  one 
to  retrieve  data,  which  is  relevant  to  a  given  query  (usually 
provided  as  text). 

At  this  time,  historical  manuscripts  like  the  George 
Washington  collection  at  the  Library  of  Congress,  are 
scanned  page-by-page  and  then  transcribed  manually  in  or¬ 
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der  to  build  an  index  from  the  electronic  transcript.  This 
process  is  prohibitive  for  large  collections,  because  of  the 
extensive  manual  labor  that  is  involved.  Automatic  ap¬ 
proaches  using  handwriting  recognition  cannot  be  applied 
(see  results  in  [17]),  since  the  current  technology  for  rec¬ 
ognizing  handwriting  from  images,  i.e.  offline  recognition1, 
has  only  been  successful  in  domains  with  very  limited  lexi¬ 
cons  and/or  high  redundancy,  such  as  legal  amount  process¬ 
ing  on  checks  and  automatic  mail  sorting.  An  alternative 
approach  called  word  spotting  [15]  involves  clustering  mul¬ 
tiple  instances  of  the  same  word  image  using  image  match¬ 
ing.  Frequent  clusters  can  then  be  used  as  index  entries,  be¬ 
cause  the  contained  images  have  links  to  the  original  pages. 
This  technique  is  expensive  -  it  requires  0(N 2)  matching 
operations  for  N  word  images  -  and  does  not  easily  allow 
for  text  queries. 

Here  we  present  an  approach  to  retrieving  handwrit¬ 
ten  historical  documents  from  a  single  author,  using  a 
relevance-based  language  model  [10,  11].  Relevance  mod¬ 
els  have  been  successfully  used  for  both  retrieval  and  cross¬ 
language  retrieval  of  text  documents  and  more  recently  for 
image  annotation [9].  In  their  original  form,  these  models 
capture  the  joint  statistical  occurrence  pattern  of  words  in 
two  languages,  which  are  used  to  describe  a  certain  domain 
(e.g.  a  news  event).  By  learning  this  dependency,  one  can 
identify  texts  of  interest,  i.e.  relevant  documents,  in  a  for¬ 
eign  language  by  describing  their  content  in  a  familiar  lan¬ 
guage. 

This  paradigm  can  be  used  for  the  image  domain,  by 
describing  images  with  words  from  a  feature  vocabulary , 
thus  generating  an  “image  description  language”.  When 
the  joint  statistical  occurrence  pattern  of  words  occurring 
in  the  image  vocabulary  and  the  image  annotation  vocab¬ 
ulary  (i.e.  word  labels)  are  learned,  one  can  perform  tasks 
such  as  image  retrieval  using  text  queries,  or  automatic  im¬ 
age  annotation. 

In  this  work,  we  model  the  occurrence  pattern  of  words 
in  two  languages  using  the  joint  probability  distribution 

1  Online  recognition,  which  records  the  pen  position,  etc.  during  writ¬ 

ing,  has  been  much  more  successful  (see  TabletPCs ). 
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over  the  image  description  vocabulary  and  the  annotation 
vocabulary.  From  a  training  set  of  annotated  images  of 
handwritten  words,  we  learn  this  joint  probability  distribu¬ 
tion  and  perform  retrieval  experiments  with  text  queries  on 
a  test  set.  We  describe  word  images  using  a  vocabulary  that 
is  derived  from  a  set  of  word  shape  features. 

Our  model  differs  from  others  in  a  number  of  respects. 
Unlike  traditional  handwriting  recognition  paradigms  [12], 
our  approach  does  not  require  perfect  recognition  for  good 
retrieval.  The  work  presented  here  is  also  related  to  models 
used  for  object  recognition/image  annotation  and  retrieval 
[6,  1,  3,  9].  However,  those  approaches  were  proposed  for 
annotating/retrieving  general-purpose  colored  photographs 
and  primarily  used  color  and  texture  as  features.  Here  our 
focus  is  on  word  images,  where  such  features  are  not  avail¬ 
able.  Instead  we  use  shape  features  to  retrieve  images.  This 
model  is  not  limited  to  handwritten  document  retrieval,  but 
can  be  extended  to  many  shape-related  retrieval  and  anno¬ 
tation  tasks  in  computer  vision. 

Using  this  relevance-based  language  model,  we  have 
conducted  retrieval  experiments  on  a  set  of  20  pages  from 
the  George  Washington  collection.  The  mean  average  pre¬ 
cision  scores  we  achieve  lie  in  the  range  from  54%  to  89% 
for  queries  using  1  to  4  words  (respectively). 

In  the  following  section  we  discuss  prior  work  in  the 
field,  followed  by  a  detailed  description  of  the  relevance- 
based  model  in  section  2.  After  explaining  the  features  we 
use  in  our  approach  (section  3),  we  present  line-retrieval  re¬ 
sults  on  the  George  Washington  collection  (section  4).  Sec¬ 
tion  5  concludes  the  paper  with  an  outlook  on  further  work. 

1.1.  Previous  Work 

There  are  a  number  of  approaches  reported  in  the  litera¬ 
ture,  which  model  the  statistical  co-occurrence  patterns  of 
image  features  and  annotation  words,  in  order  to  perform 
such  diverse  tasks  as  image  annotation,  object  recognition 
and  image  retrieval.  Mori  et  al.  [13]  estimate  the  likeli¬ 
hood  of  annotation  terms  appearing  in  a  given  image,  by 
modeling  the  co-occurrence  relationship  between  clustered 
feature  vectors  and  annotation  terms.  Duygulu  et  al.  [6] 
go  one  step  further  by  actually  annotating  individual  im¬ 
age  regions  (rather  than  producing  sets  of  keywords  for  an 
image),  which  is  in  effect  object  class  recognition.  Their 
model  uses  the  Expectation-Maximization  (EM)  algorithm 
to  build  a  probability  table  that  links  “blobs”  (clusters  of 
image  representations)  to  annotation  terms.  Barnard  and 
Forsyth[l]  extended  Hofmann’s  Hierarchical  Aspect  Model 
for  text  and  proposed  a  multi-modal  approach  to  hierarchi¬ 
cal  clustering  of  images  and  words  using  EM.  Blei  and  Jor¬ 
dan  [3]  extended  their  Latent  Dirichlet  Allocation  (LDA) 
Model  and  proposed  a  Correlation  LDA  model,  which  re¬ 
lates  words  and  images.  They  show  only  a  few  examples 
for  labeling  specific  regions  in  an  image,  so  it  is  difficult  to 


tell  how  well  this  technique  works. 

The  authors  of  [9]  introduced  the  model  used  in  this 
work  for  automatic  image  annotation  and  retrieval.  With 
the  same  data  and  feature  set,  the  results  for  image  anno¬ 
tation  were  dramatically  better  than  previous  models  -  for 
example  twice  as  good  as  the  translation  model  [6].  This 
work  extends  that  model  to  a  different  domain  (word  images 
in  a  noisy  document  environment),  uses  an  improved  fea¬ 
ture  representation  and  different  attributes  (shape).  Shape 
has  to  be  described  by  features  that  are  very  different  from 
the  previously  utilized  color  and  texture  features.  We  test 
the  model  on  a  data  set  with  larger  annotation  vocab¬ 
ulary  than  previous  experiments  and  a  feature  vector 
discretization  that  preserves  more  detail  than  the  cluster¬ 
ing  algorithms  which  are  utilized  in  other  approaches.  In 
addition,  our  application  (line  retrieval)  uses  a  new  re¬ 
trieval  model  formulation.  Other  authors  have  previously 
suggested  document-retrieval  systems  that  do  not  require 
recognition,  but  queries  have  to  be  issued  in  the  form  of  ex¬ 
amples  in  the  image  domain  (e.g.  see  [16]).  To  our  knowl¬ 
edge,  our  system  is  the  first  to  allow  retrieval  without  recog¬ 
nition  using  text  queries. 

All  of  the  image-to-word  translation  approaches  we  are 
aware  of,  operate  on  image  collections  of  good  quality 
(e.g.  the  Corel  image  data  base[6,  9]),  which  usually  con¬ 
tain  color  and  texture  information.  Color  is  known  to  be  one 
of  the  most  useful  features  for  describing  objects.  Duygulu 
et  al.  [6],  for  example,  use  half  of  the  entries  in  their  fea¬ 
ture  vectors  for  color  information.  Images  of  handwritten 
words,  on  the  other  hand,  do  not  generally  contain  color  or 
texture  information,  and  in  the  case  of  historical  documents, 
the  image  quality  is  often  greatly  reduced. 

The  lack  of  other  features  makes  shape  a  typical  choice 
for  offline  handwriting  recognition  approaches.  We  make 
use  of  holistic  word  shape  features  that  are  justified  by  psy¬ 
chological  studies  of  human  reading  [12],  and  which  are 
widely  used  in  the  field  [5,  15,  18].  However,  these  tradi¬ 
tional  features  have  varying  sizes  proportional  to  the  length 
of  the  words  and  also  tend  to  capture  variations  in  word  im¬ 
ages,  which  are  not  always  desirable.  In  order  to  capture  the 
essential  word  shape  and  to  get  feature  vectors  of  constant 
size,  we  use  the  low  order  DFT  coefficients  [7]  of  these  fea¬ 
tures  to  represent  a  word  image. 

2.  Model  Formulation 

Before  explaining  our  model  in  detail,  we  would  like  to  pro¬ 
vide  some  intuition  for  it.  Previous  research  in  cross-lingual 
information  retrieval  has  shown  that  co-occurrence  proba¬ 
bilities  of  words  in  two  languages  (e.g.  English  and  Chi¬ 
nese)  can  be  effectively  estimated  from  a  parallel  corpus, 
that  is,  a  collection  of  document  pairs,  where  each  docu¬ 
ment  is  available  in  two  languages.  Reliable  estimates  can 
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be  achieved  even  without  any  knowledge  of  the  involved 
languages. 

[11]  describes  how  to  capture  the  co-occurrence  proba¬ 
bilities  of  an  English  word  e  and  Chinese  words  q  in  the 
joint  probability  distribution  P(e,  c\  , . .  Cfc).  By  computing 
the  conditional  distribution 


P(e|ci ...  Cfc) 


P(e,ci . .  .Cfc) 
P(C1  •  •  -  Cfc) 


one  can  estimate  the  probability  of  occurrence  of  the  term  e 
in  an  English  document  given  the  occurrence  of  the  terms  c% 
in  a  Chinese  document  (which  talks  about  the  same  subject). 

Here  we  apply  this  concept  by  describing  images  of 
words  with  an  image  description  language  in  text.  To  do 
this,  we  extract  features  from  the  images  and  discretized 
them,  which  allows  us  to  represent  each  word  image  in 
terms  of  a  discrete  vocabulary.  From  a  set  of  labeled  im¬ 
ages  of  words  we  can  then  estimate  the  joint  probability 
P(w,  fi . . .  fk),  where  w  is  a  word  label  (the  word  “tran¬ 
scription”)  and  the  fi  are  words  from  the  image  description 
language.  Using  the  conditional  density  P(w\fi ...  //c)  we 
can  perform  retrieval  of  handwritten  text  without  recogni¬ 
tion  with  high  accuracy. 

2.1.  Model  Estimation 

Suppose  we  have  a  collection  C  of  annotated  manuscripts. 
We  will  model  this  collection  as  a  sequence  of  random 
variables  Wi,  one  for  each  word  position  i  in  C.  Each 
variable  Wi  takes  on  a  dual  representation:  Wi  =  {hi:Wi}, 
where  hi  is  the  image  of  the  handwritten  form  at  position  i 
in  the  collection  and  Wi  is  the  corresponding  transcription 
of  the  word.  As  we  describe  in  the  following  section, 
we  will  represent  the  surface  form  hi  as  a  set  of  discrete 
features  /),i. .  .fi,k  from  some  feature  “vocabulary”  H. 
The  transcription  Wi  is  simply  a  word  from  the  English 
vocabulary  V.  Consequently,  each  random  variable  Wi 
takes  values  of  the  form  {wi,  fi,\. .  ./*,&}.  In  the  remaining 
portions  of  this  section  we  will  discuss  how  we  can  estimate 
a  probability  distribution  over  the  variables  Wi. 


where  Ii  is  the  word  image  with  representation  W-L  (we 
will  omit  the  conditioning  on  Ii  in  the  further  derivations). 
Now  suppose  we  are  given  an  arbitrary  observation  W  = 
{w,  f\. .  .fk},  and  would  like  to  compute  the  probability  of 
that  observation  appearing  as  a  random  sample  somewhere 
in  our  corpus  C.  Because  the  observation  is  not  tied  to  any 
position,  we  have  to  estimate  the  probability  as  the  expecta¬ 
tion  over  every  position  i  in  our  entire  collection  C : 

P(w,h...fk )  =  El[P(Wl  =  wJ1...fk)\ 

,  ic|  k 

=  nw)  (2) 

Here  \C\  denotes  the  aggregate  number  of  word  positions 
in  the  collection.  Equation  (2)  gives  us  a  powerful  formal¬ 
ism  for  performing  automatic  annotation  and  retrieval  over 
handwritten  documents. 

2.2.  Automatic  Annotation  and  Retrieval  of 
Manuscripts 

Suppose  we  are  given  a  training  collection  C  of  annotated 
manuscripts,  and  a  target  collection  T  where  no  annotations 
are  provided.  Given  an  arbitrary  handwritten  image  h  we 
can  automatically  compute  its  image  vocabulary  (^feature) 
representation  fi. .  .fk  and  then  use  equation  (2)  to  predict 
the  words  w  which  are  likely  to  occur  jointly  with  the  fea¬ 
tures  of  h.  These  predictions  would  take  the  form  of  a  con¬ 
ditional  probability: 


P(w\h...fk) 


P(w,fl--Jk) 
T,vevp(v’fi  ■■•fk) 


(3) 


This  probability  could  be  used  directly  to  annotate 
new  handwritten  images  with  highly  probable  words.  We 
provide  a  brief  evaluation  for  this  kind  of  annotation  in 
section  4.2.  However,  if  we  are  interested  in  retrieving 
sections  of  manuscripts  we  can  make  another  use  of 
equation  (3). 


We  assume  that  for  each  position  i  in  the  collection 
there  exists  an  underlying  multinomial  probability  distribu¬ 
tion  Pi(-)  over  the  union  of  the  vocabularies  V  and  H.  We 
further  assume  that  actual  values  {wi,  f^i. .  .fi,k}  observed 
at  position  i  represent  an  i.i.d.  random  sample  drawn  from 
Pi(').  In  other  words,  the  probability  of  a  particular  obser¬ 
vation  is  given  by: 

k 

P(wi  =  wi,fi,1...fi,k\ii)  =  PiK|/i)IIpi(/«l^) 

3= 1 

(i) 


Suppose  we  are  given  a  user  query  Q  =  q\. .  .qm-  We 
would  like  to  retrieve  sections  SCT  of  the  target  collection 
that  contain  the  query  words.  More  generally,  we  would  like 
to  rank  the  sections  S  by  the  probability  that  they  are  rele¬ 
vant  to  Q.  One  of  the  most  effective  methods  for  ranked  re¬ 
trieval  is  based  on  the  statistical  language  modeling  frame¬ 
work  [14].  In  this  framework,  sections  S  of  text  are  ranked 
by  the  probability  that  the  query  Q  would  be  observed  dur¬ 
ing  i.i.d.  random  sampling  of  words  from  S: 

m 

P(Q\S)  =  Y[P(qj\S)  (4) 

3  =  1 
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In  text  retrieval,  estimating  the  probability  P(qj\S)  is 
straightforward  -  we  just  count  how  many  times  the  word 
qj  actually  occurred  in  S,  and  then  normalize  and  smooth 
the  counts.  When  we  are  dealing  with  handwritten  docu¬ 
ments  we  do  not  know  what  words  did  or  did  not  occur  in  a 
given  section  of  text.  However,  we  can  use  the  conditional 
estimate  provided  by  equation  (3): 

i  |s| 

P(qj\S)  =  —  '52p(<b\fo,i~-fo,k)  (5) 

'  '  0=1 

Here  \S\  refers  to  the  number  of  word-images  in  S,  the 
index  o  goes  over  all  positions  in  5,  and  f0,i--  - fo,k  repre¬ 
sent  a  set  of  features  derived  from  the  word  image  in  posi¬ 
tion  o.  Combining  equations  (4)  and  (5)  provides  us  with  a 
complete  system  for  handwriting  retrieval. 

2.3.  Estimation  Details 

In  this  section  we  provide  the  estimation  details  necessary 
for  a  successful  implementation  of  our  model.  In  order 
to  use  equation  (2)  we  need  estimates  for  the  multinomial 
models  P^(-)  that  underly  every  position  i  in  the  training 
collection  C.  We  estimate  these  probabilities  via  smoothed 
relative  frequencies: 

Pi(x)  =  ^-P-(5(x  e  {wufi'i. .  .fi,k}) 

+  (1  +  A0|C|  E  6(x  e  iwi>  fit-  •  •/*.*»  (6) 

where  S(x  £  {w,  f\. .  .fk})  is  a  set  membership  function, 
equal  to  one  if  and  only  if  x  is  either  w  or  one  of  the  feature 
vocabulary  terms  fi. .  .fk-  Parameter  A  controls  the  degree 
of  smoothing  on  the  frequency  estimate  and  can  be  tuned 
empirically. 


4.  discretize  each  feature  dimension  using  a  binning 
scheme,  and  produce  one  vocabulary  term  per  bin. 

In  the  following  sections,  the  word  image  features  are 
described,  followed  by  an  explanation  of  the  vocabulary 
generation  (i.e.  feature  discretization)  process.  These  steps 
require  image  normalization  which  removes  some  of  the 
variability  that  is  present  even  in  single-author  handwrit¬ 
ing.  Figure  2  shows  the  results  of  background  cleaning  and 
slant/skew/baseline-correction  on  a  typical  input  image. 


„  /ji _ 


(a)  original  image,  as  segmented  from  document, 


(b)  after  cleaning  and  normalization. 

Figure  2:  Image  cleaning  and  normalization. 


3.1.  Scalar  Features 

Each  of  the  features  described  here,  can  be  expressed  by  a 
scalar  (a  single  number).  Part  of  them  have  been  used  previ¬ 
ously  (see  e.g.  [15])  to  quickly  determine  coarse  similarity 
between  word  images.  For  a  given  image  with  tight  bound¬ 
ing  box  (no  extra  space  around  word)  we  extract: 

1.  the  heights, 

2.  the  width  w, 

3.  the  aspect  ratio  w/h , 


3.  Word  Image  Features 

The  mathematical  formulation  of  our  retrieval  approach  re¬ 
quires  that  word  images  are  represented  in  terms  of  a  feature 
vocabulary  with  discrete  entries.  This  is  achieved  in  a  four- 
step  process  (see  Figure  1  for  an  illustration): 

1 .  extract  single- valued/scalar  features  (e.g.  image  width) 
and  variable-length  features  (e.g.  projection  profile) 
from  the  word  image. 

2.  compute  a  fixed-length  description  of  the  variable- 
length  features  by  using  low-order  Fourier  coefficients. 

3.  combine  scalar  features  and  Fourier  coefficients  into  a 
fixed-length  feature  vector. 


4.  the  area  w  •  h ,  and 

5.  an  estimate  for  the  number  of  descenders  in  the  word, 
i.e.  strokes  below  the  baseline  (e.g.  lower  part  of  ’p’). 


3.2.  Variable-Length  Features 

The  variable-length  features  we  use  give  a  much  more  de¬ 
tailed  view  of  a  word’s  shape  than  single- valued  features 
can.  All  of  the  time  series  features  below  have  been  success¬ 
fully  used  in  a  whole- word  matching  approach  [15].  Each 
feature  results  from  recording  a  single  number  per  image 
column  in  the  word  image,  thus  creating  a  time  series  of  the 
same  length  as  the  width  of  the  image. 

We  generate  three  time  series: 
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Continuous-space  Discrete-space 


Figure  1:  Feature  generation  process. 


(a)  projection  profile  time  series, 


0  50  100  150  200  250  300  350  400  450 

image  column  ("time") 


(b)  upper  word  profile  time  series. 

Figure  3:  Two  of  the  three  utilized  time  series  features.  Both 
features  were  directly  extracted  from  image  2(b). 


1.  Projection  Profile:  each  time  series  value  is  the  sum 
of  the  pixel  intensities  in  the  corresponding  image  col¬ 
umn  (see  Figure  3(a)  for  an  example). 


2.  Upper  Word  Profile:  each  value  is  the  distance  from 
the  top  of  the  word’s  bounding  box  to  the  first  “ink” 
pixel  in  the  corresponding  image  column  (see  Figure 
3(b)). 

3.  Lower  Word  Profile:  same  as  upper  word  profile,  but 
distance  is  measured  from  bottom  of  image  bounding 
box. 

All  of  these  features  are  normalized  so  that  their  maximum 
range  range  is  [0..1].  This  ensures  that  features  are  com¬ 
parable  across  words  of  different  heights.  The  quality  of 
these  features  strongly  depends  on  good  image  normaliza¬ 
tion.  For  example,  slant  can  affect  the  visibility  of  parts 
of  words  in  terms  of  the  word  profile  features  (e.g.  the  T 
leaning  over  the  ’e’  in  Figure  2(a)). 

While  these  time  series  features  capture  the  shape  of  a 
word  in  great  detail,  they  vary  in  length,  and  thus  cannot  be 
used  in  our  framework,  which  requires  fixed-length  feature 
vectors.  A  time  series  can  be  adequately  approximated  by 
the  lower-order  coefficients  of  its  Discrete  Fourier  Trans¬ 
form  (DFT)  [7].  The  DFT  representation  also  takes  into 
account  that  images  can  have  different  lengths,  since  one 
period  of  the  DFT  basis  functions  is  equal  to  the  number  of 
sample  points. 

We  perform  the  DFT  on  the  time  series  s  =  so  •  •  •  sn- 1 
to  get  its  frequency-space  representation  S  =  So...Sn-i: 

n—  1 

Sk  =  ^2si-e~27rilk/n,  0  <  k  <  n  —  1.  (7) 

z=o 

From  the  DFT  representation  we  extract  the  first  4  real  (co¬ 
sine)  components  and  3  imaginary  (sine)  components2  for 
use  as  single- valued  features.  Figure  4  shows  a  reproduction 
of  the  time  series  in  Figure  3(a)  using  these  features.  For  our 
purposes,  this  approximation  suffices,  since  the  goal  is  not 
to  represent  the  original  signal  in  detail,  but  rather  to  capture 
the  global  word  shape  with  a  small  number  of  descriptors. 

2For  real-valued  signals,  the  first  imaginary  component  of  the  DFT  is 
always  0. 
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image  column  ("time") 


Figure  4:  Projection  profile  time  series  from  Figure  3(a), 
reconstructed  using  4  lowest-order  DFT  coefficients. 

3.3.  Discretizing  Features  /  Vocabulary 

With  all  features  combined,  we  have  a  continuous-space 
vector  with  5  +  3  •  (4  +  3)  =  26  entries.  Our  relevance 
model  requires  us  to  represent  all  feature  vectors  in  terms 
of  a  fixed-size  “feature  vocabulary”.  This  can  be  achieved 
by  discretizing  each  entry  of  the  feature  vector  and  creating 
one  vocabulary  term  per  discretization  step.  Then  the  vo¬ 
cabulary  representation  of  a  feature  vector  is  comprised  of 
the  terms  that  correspond  to  the  discretization  steps  of  each 
vector  entry. 

We  chose  a  discretization  strategy  that  divides  the  ob¬ 
served  range  of  each  feature  dimension  in  the  training 
set  into  10  parts  (bins)  of  equal  size.  Since  similar  fea¬ 
ture  values  could  end  up  in  neighboring  bins  if  they  fall 
into  the  region  where  two  bins  meet,  we  use  a  second 
set  of  9  bins  with  shifted  bin  centers.  Figure  5  illus¬ 
trates  this  idea.  This  discretization  process  uses  two  vo¬ 
cabulary  terms  (e.g.  f eaturel2_binset l_bin5  and 
f  eaturel2_binset2_bin4)  to  represent  a  feature  vec¬ 
tor  entry.  Per  word  image,  this  results  in  a  representation 
that  uses  26  •  2  =  52  feature  vocabulary  terms. 

4.  Experimental  Evaluation 

In  this  section  we  discuss  the  experiments  we  carried  out 
to  evaluate  the  proposed  retrieval  model.  We  will  discuss 
two  types  of  evaluation.  First,  we  briefly  look  at  the  pre¬ 
dictive  capability  of  the  annotation  as  outlined  in  section  2. 
We  train  a  model  on  a  small  set  of  annotated  manuscripts 
and  evaluate  how  well  the  model  was  able  to  annotate  each 
word  in  a  held-out  portion  of  the  dataset.  Then  we  turn  to 
evaluating  the  model  in  the  context  of  ranked  retrieval. 

The  data  set  we  used  in  training  and  evaluating  our 
approach  consists  of  20  manually  annotated  pages  from 


George  Washington’s  handwritten  letters.  Segmenting  this 
collection  yielded  a  total  of  4773  images,  from  which  the 
majority  contain  exactly  one  word.  An  estimated  5-10%  of 
the  images  contain  segmentation  errors  of  varying  degrees: 
parts  of  words  that  have  faded  tend  to  get  missed  by  the 
segmentation,  and  occasionally  images  contain  2  or  more 
words  or  only  a  word  fragment. 

4.1.  Evaluation  Methodology 

Our  dataset  comprises  4773  total  word  occurrences  ar¬ 
ranged  on  657  lines.  Because  of  the  relatively  small  size 
of  the  dataset,  all  of  our  experiments  use  a  10-fold  random¬ 
ized  cross-validation,  where  each  time  the  data  is  split  into  a 
90%  training  and  10%  testing  sets.  Splitting  was  performed 
on  a  line  level,  since  we  chose  lines  to  be  our  retrieval  unit. 
Prior  to  any  experiments,  the  manual  annotations  were  re¬ 
duced  to  the  root  form  using  the  Krovetz  morphological  an¬ 
alyzer.  This  is  a  standard  practice  in  Information  Retrieval, 
it  allows  one  to  search  for  semantically  similar  variants  of 
the  same  word.  For  our  annotation  experiments  we  use  ev¬ 
ery  word  of  the  4773- word  vocabulary  that  occurs  in  both 
the  training  and  the  testing  set.  For  retrieval  experiments, 
we  remove  all  function  words,  such  as  “of”,  “the”,  “and”, 
etc.  Furthermore,  to  simulate  real  queries  users  might  pose 
to  our  system,  we  tested  all  possible  combinations  of  2,  3 
and  4  words  that  occurred  on  the  same  line  in  the  testing, 
but  not  necessarily  in  the  training  set.  Function  words  were 
excluded  from  all  of  these  combinations. 

We  use  the  standard  evaluation  methodology  of  Infor¬ 
mation  Retrieval.  In  response  to  a  given  query,  our  model 
produces  a  ranking  of  all  lines  in  the  testing  set.  Out  of 
these  lines  we  consider  only  the  ones  that  contain  all  query 
words  to  be  relevant.  The  remaining  lines  are  assumed  to 
be  non-relevant.  Then  for  each  line  in  the  ranked  list  we 
compute  recall  and  precision.  Recall  is  defined  as  the  num¬ 
ber  of  relevant  lines  above  (and  including)  the  current  line, 
divided  by  the  total  number  of  relevant  lines  for  the  current 
query.  Similarly,  precision  is  defined  as  number  of  above 
relevant  lines  divided  by  the  rank  of  the  current  line.  Re¬ 
call  is  a  measure  of  what  percent  of  relevant  lines  we  found, 
and  precision  suggests  how  many  non-relevant  lines  we  had 
to  look  at  to  achieve  that  recall.  In  our  evaluation  we  use 
plots  of  precision  vs.  recall,  averaged  over  all  queries  and 
all  cross-validation  repeats.  We  also  report  Mean  Average 
Precision,  which  is  an  average  of  precision  values  at  all  re¬ 
call  points. 

4.2.  Discussion  of  Results 

Figure  6  shows  the  performance  of  our  model  on  the  task 
of  assigning  word  labels  to  handwritten  images.  We  carried 
out  two  types  of  evaluation.  In  position-level  evaluation, 
we  generated  a  probability  distribution  P(w\fi^. .  -fi,k)  for 
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feature  value  range  ob¬ 
served  in  training  data 

I  .  ..  |  bin 9  |  bin  10 

value 

bin  1  |  bin  2  |  ...  | _ | _ | _ |  ...  |  bin  8  |  bin  9 

^bin  set  2 

Figure  5:  Binning  scheme  used  in  discretizing  feature  values  (shown  for  one  feature  dimension). 


Quality  of  Automatic  Annotation  Quality  of  Ranked  Retrieval 


Recall 


Figure  6:  Performance  on  annotating  word  images  with 
words. 

every  position  i  in  the  testing  set.  Then  we  looked  for  the 
rank  of  the  correct  word  w  in  that  distribution  and  averaged 
the  resulting  recall  and  precision  over  all  positions.  Since 
we  did  not  exclude  function  words  at  this  stage,  position- 
level  evaluation  is  strongly  biased  toward  very  common 
words  such  as  “of”,  “the”  etc.  These  words  are  generally 
not  very  interesting,  so  we  carried  out  a  word-level  evalua¬ 
tion.  Here  for  a  given  word  w  we  look  at  the  ranked  list  of 
all  the  positions  i  in  the  testing  set,  sorted  in  the  decreasing 
order  of  P{w\fi^  .  ./*,&).  This  is  similar  to  running  w  as  a 
query  and  retrieving  all  positions  in  which  it  could  possibly 
occur.  Recall  and  precision  were  calculated  as  discussed  in 
the  previous  section. 

From  the  graphs  in  Figure  6  we  observe  that  our  model 
performs  quite  well  in  annotation.  For  position-level 
annotation,  we  achieve  50%  precision  at  rank  1,  which 
means  that  for  a  given  position  i,  half  the  time  the  word  w 
with  the  highest  conditional  probability  P{w\fi^. .  .fi,k) 
is  the  correct  one.  Word-oriented  evaluation  also  has  close 
to  50%  precision  at  rank  1,  meaning  that  for  a  given  word 
w  the  highest-ranked  position  i  contains  that  word  almost 
half  the  time.  Mean  Average  Precision  values  are  54%  and 
52%  for  position-oriented  and  word-oriented  evaluations 
respectively. 
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Figure  7:  Performance  on  ranked  retrieval  with  different 
query  sizes. 


Now  we  turn  our  attention  to  using  our  model  for  the 
task  of  retrieving  relevant  portions  of  manuscripts.  As  dis¬ 
cussed  before,  we  created  four  sets  of  queries:  1,  2,  3  and 
4  words  in  length,  and  will  test  them  on  retrieving  line  seg¬ 
ments.  Our  experiments  involve  a  total  of  1950  single- word 
queries,  1939  word  pairs,  1870  3-word  and  1558  4-word 
queries  over  657  lines.  Figure  7  shows  the  recall-precision 
graphs.  It  is  very  encouraging  to  see  that  our  model  per¬ 
forms  extremely  well  in  this  evaluation,  reaching  over  90% 
mean  precision  at  rank  1 .  This  is  an  exceptionally  good  re¬ 
sult,  showing  that  our  model  is  nearly  flawless  when  even 
such  short  queries  are  used.  Mean  average  precision  values 
were  54%,  63%,  78%  and  89%  for  1-,  2-,  3-  and  4-word 
queries  respectively.  Figures  8,  9  and  10  show  three  re¬ 
trieval  results  (two  good  and  one  bad)  with  variable-length 
queries.  We  have  implemented  a  demo  web-interface  for 
our  retrieval  system,  which  can  be  found  at  URL  here!. 

5.  Summary  and  Conclusion 

We  have  presented  a  relevance-based  language  model  for 
the  retrieval  of  handwritten  documents.  Our  model  esti¬ 
mates  the  joint  probability  of  occurrence  of  word  annota- 
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Figure  8:  Retrieval  result  for  the  4-word  query  “sergeant  wilper  fort  Cumberland”  (one  relevant  line  in  collection). 


Figure  9:  Retrieval  result  for  the  3 -word  query  “men  Virginia  regiment”  (one  relevant  line  in  collection). 


rankl: 

rank2: 

rank3: 

rank4: 


Figure  10:  Retrieval  result  for  the  1-word  query  “sergeant”  (three  relevant  lines  in  collection). 


tions  and  feature  vocabulary  terms  in  order  to  perform  prob¬ 
abilistic  annotation  of  whole  words  and  retrieval  of  lines  of 
handwritten  text.  Our  approach  is  the  first  to  use  shape- 
based  features,  and  we  presented  appropriate  shape  repre¬ 
sentation,  discretization  and  retrieval  techniques.  The  re¬ 
sults  for  line  retrieval  indicate  performance  at  a  level  that  is 
practical  for  real-world  applications. 

Future  work  will  include  a  retrieval  system  for  a  larger 
collection,  with  page  retrieval.  Extending  the  collection 
could  require  more  features  in  order  to  discriminate  better 
between  similar  words.  Lastly,  we  would  also  like  to  work 
on  new  retrieval  models. 
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